Conceptual-Model-Based Data Extraction from Multiple-Record Web Pages

نویسندگان

  • David W. Embley
  • Douglas M. Campbell
  • Y. S. Jiang
  • Stephen W. Liddle
  • Yiu-Kai Ng
  • Dallan Quass
  • Randy D. Smith
چکیده

Electronically available data on the Web is exploding at an ever increasing pace. Much of this data is unstructured, which makes searching hard and traditional database querying impossible. Many Web documents, however, contain an abundance of recognizable constants that together describe the essence of a document’s content. For these kinds of data-rich, multiple-record documents (e.g. advertisements, movie reviews, weather reports, travel information, sports summaries, financial statements, obituaries, and many others) we can apply a conceptual-modeling approach to extract and structure data automatically. The approach is based on an ontology—a conceptual model instance—that describes the data of interest, including relationships, lexical appearance, and context keywords. By parsing the ontology, we can automatically produce a database scheme and recognizers for constants and keywords, and then invoke routines to recognize and extract data from unstructured documents and structure it according to the generated database scheme. Experiments show that it is possible to achieve good recall and precision ratios for documents that are rich in recognizable constants and narrow in ontological breadth. Our approach is less labor-intensive than other approaches that manually or semiautomatically generate wrappers, and it is generally insensitive to changes in Web-page format.

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Data Extraction using Content-Based Handles

In this paper, we present an approach and a visual tool, called HWrap (Handle Based Wrapper), for creating web wrappers to extract data records from web pages. In our approach, we mainly rely on the visible page content to identify data regions on a web page. In our extraction algorithm, we inspired by the way a human user scans the page content for specific data. In particular, we use text fea...

متن کامل

OLERA: On-Line Extraction Rule Analysis for Semi-structured Documents

The vast amount of online information available has led to renewed interest in information extraction (IE) systems that analyze input documents to produce a structured representation of selected information from the documents. Information extraction from semistructured documents has been studied extensively recently. Most researches focus on supervised learning approaches where targets must be ...

متن کامل

Automatic Information Extraction for Multiple Singular Web Pages

TheWorld WideWeb is now undeniably the richest and most dense source of information, yet its structure makes it diÆcult to make use of that information in a systematic way. This paper extends a pattern discovery approach called IEPAD to the rapid generation of information extractors that can extract structured data from semi-structured Web documents. IEPAD is proposed to automate wrapper genera...

متن کامل

Taxonomy Based Data Extraction from Multi-item Web Pages

In this paper, we present a taxonomy-driven approach to the extraction of data records from web pages containing multiple similar items. In our approach, we first automatically extract a taxonomy from the web for the target domain, then extract data records from web pages in the target domain, and finally use the automatically-extracted taxonomy to automatically annotate feature values in the t...

متن کامل

Sequential Pattern Mining for Web Extraction Rule Generalization

Information extraction (IE) is an important problem for information integration with broad applications. It is an attractive application for machine learning. The core of this problem is to learn extraction rules from given input. This paper extends a pattern discovery approach called IEPAD to the rapid generation of information extractors that can extract structured data from semi-structuredWe...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:
  • Data Knowl. Eng.

دوره 31  شماره 

صفحات  -

تاریخ انتشار 1999